## [1] "/Users/johnngo/Desktop"
Tip: Before you create any plots, it is a good idea to provide a short introduction into the dataset that you are planning to explore. Replace this quoted text with that general information!
In our dataset, we will be exploring the different variables in white wine to draw insightful meaning on the relationship between the variables and quality.
Tip: In this section, you should perform some preliminary exploration of your dataset. Run some summaries of the data and create univariate plots to understand the structure of the individual variables in your dataset. Don’t forget to add a comment after each plot or closely-related group of plots! There should be multiple code chunks and text sections; the first one below is just to help you get started.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Our dataset contains 4898 observations and 13 variables. To show several similar histograms in one plot we need to use facets where we show multiple histograms in one shot, using the melt method.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The median fixed acidity in the wine is 6.8 g/l, we can see that commonly white wine have an acidity level between 5.5 - 8.5 g/l.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The distribution of volatile acidity is slightly right skewed with a median of 0.26 g/l. There are some outliers on the higher end of the scale.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
The overall distribution of citric acid is normal with the median being 0.32g/l and the mean at 0.334g/l.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Our residual sugar plot has a median value of 5.2g/l. The distribution is right skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Our plot for chloride shows a median of 0.043 g/l with a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Our free sulfur dioxide plot is normally distributed and slightly right skewed. The median value is 34 g/l.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Our total sulfur dioxide graph has a normal distribution and it’s slightly right skewed. The median value is 134 g/l.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The density of our white wines are in a very narrow range of .9917 - .9961. The median value is .9937.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Our pH plot shows an overall pH level ranging between 2.9-3.5 with a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Our sulphate plot is slightly right skewed with a median of .47g/l.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Our alcohol plot is right skewed. We notice it starts at 8%, perhaps a minimum level of alcohol required for a wine.
Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!
The dataset is a long format data with 4898 observation with 13 variables. 11 of the variables are measurements of a chemical property and one variable measuring the overall taste quality. Lastly, one variable listing the unique observation ID.
The main feature of interest is the quality rating
I think by testing the different supporting variables will provide some insightful information that may help us with the investigation. We understand that some of the variables may have more of an impact on quality compared to the other ones.For example, we may notice some variables that may have a stronger correlation compared to others.
No new variables were created in the dataset.
There were no unusual distributions, no missing value and no need to adjust for data. The current dataset is already cleaned which makes it a good dataset to analyze.
Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.
## [1] "median of fixed.acidity by quality:"
## dataset$quality: 3
## [1] 7.3
## --------------------------------------------------------
## dataset$quality: 4
## [1] 6.9
## --------------------------------------------------------
## dataset$quality: 5
## [1] 6.8
## --------------------------------------------------------
## dataset$quality: 6
## [1] 6.8
## --------------------------------------------------------
## dataset$quality: 7
## [1] 6.7
## --------------------------------------------------------
## dataset$quality: 8
## [1] 6.8
## --------------------------------------------------------
## dataset$quality: 9
## [1] 7.1
We see a steady level of fixed acidity and it seems that fixed acidity is relatively stable across the different level of quality. Additionally, we see big dispersion of acidity values across the different quality levels. This may suggest that there may be other variables at play that contribute to the overall quality.
## [1] "median of volatile.acidity by quality:"
## dataset$quality: 3
## [1] 0.26
## --------------------------------------------------------
## dataset$quality: 4
## [1] 0.32
## --------------------------------------------------------
## dataset$quality: 5
## [1] 0.28
## --------------------------------------------------------
## dataset$quality: 6
## [1] 0.25
## --------------------------------------------------------
## dataset$quality: 7
## [1] 0.25
## --------------------------------------------------------
## dataset$quality: 8
## [1] 0.26
## --------------------------------------------------------
## dataset$quality: 9
## [1] 0.27
The median level of volatile acidity is stable across the the different levels of quality, however, we do notice a slight dip as the quality rating increases.
## [1] "median of citric.acid by quality:"
## dataset$quality: 3
## [1] 0.345
## --------------------------------------------------------
## dataset$quality: 4
## [1] 0.29
## --------------------------------------------------------
## dataset$quality: 5
## [1] 0.32
## --------------------------------------------------------
## dataset$quality: 6
## [1] 0.32
## --------------------------------------------------------
## dataset$quality: 7
## [1] 0.31
## --------------------------------------------------------
## dataset$quality: 8
## [1] 0.32
## --------------------------------------------------------
## dataset$quality: 9
## [1] 0.36
With our observation, we notice that there is a slight increase in quality with an increase in citric acid.
## [1] "median of residual.sugar by quality:"
## dataset$quality: 3
## [1] 4.6
## --------------------------------------------------------
## dataset$quality: 4
## [1] 2.5
## --------------------------------------------------------
## dataset$quality: 5
## [1] 7
## --------------------------------------------------------
## dataset$quality: 6
## [1] 5.3
## --------------------------------------------------------
## dataset$quality: 7
## [1] 3.65
## --------------------------------------------------------
## dataset$quality: 8
## [1] 4.3
## --------------------------------------------------------
## dataset$quality: 9
## [1] 2.2
Residual sugar seems a bit sporadic relative to quality. It may have a low impact on the quality of wine.
## [1] "median of chlorides by quality:"
## dataset$quality: 3
## [1] 0.041
## --------------------------------------------------------
## dataset$quality: 4
## [1] 0.046
## --------------------------------------------------------
## dataset$quality: 5
## [1] 0.047
## --------------------------------------------------------
## dataset$quality: 6
## [1] 0.043
## --------------------------------------------------------
## dataset$quality: 7
## [1] 0.037
## --------------------------------------------------------
## dataset$quality: 8
## [1] 0.036
## --------------------------------------------------------
## dataset$quality: 9
## [1] 0.031
Based on our observation, there is a very slight relation, as chloride decreases, the quality increases marginally.
## [1] "median of free.sulfur.dioxide by quality:"
## dataset$quality: 3
## [1] 33.5
## --------------------------------------------------------
## dataset$quality: 4
## [1] 18
## --------------------------------------------------------
## dataset$quality: 5
## [1] 35
## --------------------------------------------------------
## dataset$quality: 6
## [1] 34
## --------------------------------------------------------
## dataset$quality: 7
## [1] 33
## --------------------------------------------------------
## dataset$quality: 8
## [1] 35
## --------------------------------------------------------
## dataset$quality: 9
## [1] 28
Our free sulfur dioxide plot, takes a slight dip, then flattens out relative to the quality.
## [1] "median of total.sulfur.dioxide by quality:"
## dataset$quality: 3
## [1] 159.5
## --------------------------------------------------------
## dataset$quality: 4
## [1] 117
## --------------------------------------------------------
## dataset$quality: 5
## [1] 151
## --------------------------------------------------------
## dataset$quality: 6
## [1] 132
## --------------------------------------------------------
## dataset$quality: 7
## [1] 122
## --------------------------------------------------------
## dataset$quality: 8
## [1] 122
## --------------------------------------------------------
## dataset$quality: 9
## [1] 119
With our total sulfur dioxide, we a see a temporary pop, then flattens out.
## [1] "median of density by quality:"
## dataset$quality: 3
## [1] 0.994425
## --------------------------------------------------------
## dataset$quality: 4
## [1] 0.9941
## --------------------------------------------------------
## dataset$quality: 5
## [1] 0.9953
## --------------------------------------------------------
## dataset$quality: 6
## [1] 0.99366
## --------------------------------------------------------
## dataset$quality: 7
## [1] 0.99176
## --------------------------------------------------------
## dataset$quality: 8
## [1] 0.99164
## --------------------------------------------------------
## dataset$quality: 9
## [1] 0.9903
We notice a pattern with density, as the density level decreases, the overall quality increases.
## [1] "median of pH by quality:"
## dataset$quality: 3
## [1] 3.215
## --------------------------------------------------------
## dataset$quality: 4
## [1] 3.16
## --------------------------------------------------------
## dataset$quality: 5
## [1] 3.16
## --------------------------------------------------------
## dataset$quality: 6
## [1] 3.18
## --------------------------------------------------------
## dataset$quality: 7
## [1] 3.2
## --------------------------------------------------------
## dataset$quality: 8
## [1] 3.23
## --------------------------------------------------------
## dataset$quality: 9
## [1] 3.28
We notice a slight trend with the increase in pH, where quality tends to follow with a higher level of pH.
## [1] "median of sulphates by quality:"
## dataset$quality: 3
## [1] 0.44
## --------------------------------------------------------
## dataset$quality: 4
## [1] 0.47
## --------------------------------------------------------
## dataset$quality: 5
## [1] 0.47
## --------------------------------------------------------
## dataset$quality: 6
## [1] 0.48
## --------------------------------------------------------
## dataset$quality: 7
## [1] 0.48
## --------------------------------------------------------
## dataset$quality: 8
## [1] 0.46
## --------------------------------------------------------
## dataset$quality: 9
## [1] 0.46
Sulphates are fairly stable across the board relative to the quality.
## [1] "median of alcohol by quality:"
## dataset$quality: 3
## [1] 10.45
## --------------------------------------------------------
## dataset$quality: 4
## [1] 10.1
## --------------------------------------------------------
## dataset$quality: 5
## [1] 9.5
## --------------------------------------------------------
## dataset$quality: 6
## [1] 10.5
## --------------------------------------------------------
## dataset$quality: 7
## [1] 11.4
## --------------------------------------------------------
## dataset$quality: 8
## [1] 12
## --------------------------------------------------------
## dataset$quality: 9
## [1] 12.5
Eventhough there is a slight dip at the quality rating 5, we notice that a higher level alcohol content is associated with a higher rating in wine quality.
It looks like fixed.acidity has a negative relationship with pH, as fixed acidity declines, pH increases. We notice out of all the acidity group, fixed acidity carries a bigger weight relative to the other acids.
Based on our observation, when we compare volatile acidity to pH, the relationship is concentrated around the 0.1 to 0.45 and 2.9 to 3.5pH.
With citric acid, similar to volatile acidity, the relationship is concentrated in a particular region of .1-.6 and 2.8 - 3.5pH.
##
## Pearson's product-moment correlation
##
## data: pH and log10(fixed.acidity)
## t = -33.783, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4572280 -0.4117972
## sample estimates:
## cor
## -0.4347892
##
## Pearson's product-moment correlation
##
## data: pH and log10(volatile.acidity)
## t = -3.7719, df = 4896, p-value = 0.0001639
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08171127 -0.02586052
## sample estimates:
## cor
## -0.05382799
##
## Pearson's product-moment correlation
##
## data: pH and citric.acid
## t = -11.614, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1908793 -0.1363671
## sample estimates:
## cor
## -0.1637482
We see that density and chlorides concentrated in one area.
We notice that alcohol percentage decreases as density marginally increases.
Alcohol percentage decreases and chlorides marginally increases.
There is no discernible change as sulphate level changes, free sulfur dioxide remains relatively stable.
There is no discernible change as sulphate level changes, total sulfur dioxide remains relatively stable.
There is no discernible change as sulphate level changes, chlorides remains relatively stable.
## [,1]
## X 0.04199914
## fixed.acidity -0.08448545
## volatile.acidity -0.19656168
## citric.acid 0.01833273
## residual.sugar -0.08206979
## chlorides -0.31448848
## free.sulfur.dioxide 0.02371338
## total.sulfur.dioxide -0.19668029
## density -0.34835102
## pH 0.10936208
## sulphates 0.03331897
## alcohol 0.44036918
Overall, white wine quality has a stronger relationship with volatile acidity, chlorides, total sulfur dioxide,density and alcohol.
It was interesting to see a strong relationship between fixed acidity and pH. Perhaps it was due to a higher concentration relative to the other acids.
The strongest relationship we found with quality is the alcohol percentage.
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
We notice as volatile acidity decreases and as alcohol increases, the quality of wine increases.
Our plot here is a bit sporadic, we see high quality wine at different points on pH as well as different ranges of fixed acidity.
We notice as chlorides decreases and as alcohol increases, the quality of wine increases.
Total sulfur dioxide seem to have little to no effect on quality as we can see higher level quality is associated with higher level of alcohol percentage.
##
## Pearson's product-moment correlation
##
## data: alcohol and sulphates
## t = -1.22, df = 4896, p-value = 0.2225
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04541705 0.01057885
## sample estimates:
## cor
## -0.01743277
The main part of our investigation were to investigate the features that had the highest correlation with quality.
In our plot we see how alcohol and volatile acidity connect with quality ratings. Higher alcohol and lower volatile acidity tend to produce better quality wine.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
With our plot, we can see the distribution of the volatile acidity across the different range of quality rating. The boxplot shows us the minimum, first quartile, median, third quartile, maximum value. The dots show us the distribution of wine in the categories. We can see the dots concentrated around the middle quality ratings and lower frequency on the lower and higher part of the quality ratings. The red line running across the boxplots helps with visualizing the trend between volatile acid and quality rating. We see that as volatile acid declines, the rating quality increases.
We can see alcohol contributing to the quality of wine. Eventhough the box plot may suggest that the impact declines from the rating quality of 3-5, we can see a strong incline from 5 and onwards on quality.
When we compare volatile acid and alcohol, we see that these two variables which are both correlated to quality have an impact on the rating quality of wines. On the plot, we notice that the lower quality wine have high volatile acid and low alcohol level and as we move to the right of the graph, we will see lower levels of volatile acid and high alcohol level associated with higher wine quality.
Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.
This project was a great oppportunity to apply some of the R skills which we learnt earlier in the lessons. It was a great way to explore the different function and plotting feature that R offers.
One of the challenges was selecting a meaningful variable that you wanted to dig deeper into and build around that. In addition, selecting supporting variables that are highly correlated to your main variable was also a challenge and insightful at the same time.
Because R is such a powerful tool, it made exploring the data much more effective, it helps us see trends and allows us to draw meaningful insights.
With this project, we were able to identify some of the trends in the data, perhaps we can build prediction models and see how this trend can be used to predict the wine quality based on the unique variables.